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Abstract 

This article provides the first procedure for computing a fully data-dependent interval that traps 
the mixing time tmix of a finite reversible ergodic Markov chain at a prescribed confidence level. The 
interval is computed from a single finite-length sample path from the Markov chain, and does not require 
the knowledge of any parameters of the chain. This stands in contrast to previous approaches, which 
either only provide point estimates, or require a reset mechanism, or additional prior knowledge. The 
interval is constructed around the relaxation time treiax, which is strongly related to the mixing time, 
and the width of the interval converges to zero roughly at a y/n rate, where n is the length of the sample 
path. Upper and lower bounds are given on the number of samples required to achieve constant-factor 
multiplicative accuracy. The lower bounds indicate that, unless further restrictions are placed on the 
chain, no procedure can achieve this accuracy level before seeing each state at least n(treiax) times on 
the average. Finally, future directions of research are identified. 


1 Introduction 

This work tackles the challenge of constructing fully empirical bounds on the mixing time of Markov chains 
based on a single sample path. Let {Xt)t=i, 2 ,... be an irreducible, aperiodic time-homogeneous Markov chain 
on a finite state space [d] := {l,2,...,d} with transition matrix P. Under this assumption, the chain 
converges to its unique stationary distribution tt = {TTi)f^i regardless of the initial state distribution q: 

lim Pr„ {Xt = i) = lim {qP*)i = tti for each i G [d]. 

t—¥oo t—>oo 

The mixing time tmix of the Markov chain is the number of time steps required for the chain to be within a 
fixed threshold of its stationary distribution: 

Gix := min < t G N : sup max |Pr„ {Xt G A) — 7r(A)| < 1/4 I . (1) 

I q Ac.[d] j 

Here, 7r(A) = probability assigned to set A by tt, and the supremum is over all possible 

initial distributions q. The problem studied in this work is the construction of a non-trivial confidence 
interval Cn = Cn{Xi,X 2 , ■ ■ ■, 6) C [0, oo], based only on the observed sample path (Xi, X 2 ,..., X„) and 

S G (0,1), that succeeds with probability 1 — <5 in trapping the value of the mixing time tmix- 

This problem is motivated by the numerous scientific applications and machine learning tasks in which 
the quantity of interest is the mean 7r(/) = '!^if(i) for some function / of the states of a Markov chain. 

This is the setting of the celebrated Markov Chain Monte Carlo (MCMC) paradigm [28], but the problem also 
arises in performance prediction involving time-correlated data, as is common in reinforcement learning [42] . 
Observable bounds on mixing times are useful in the design and diagnostics of these methods; they yield 
effective approaches to assessing the estimation quality, even when a priori knowledge of the mixing time or 
correlation structure is unavailable. 


1 


Main results. We develop the first procedure for constructing non-trivial and fully empirical confidence 
intervals for Markov mixing time. Consider a reversible ergodic Markov chain on d states with absolute 
spectral gap 7 * and stationary distribution minorized by tt*. As is well-known [25, Theorems 12.3 and 12.4], 


(trelax- l)ln2 < t 


mix 


^ tj-elax In 

TT* 


( 2 ) 


where treiax := 1/7* is the relaxation time. Hence, it suffices to estimate 7 * and tt*. Our main results are 
summarized as follows. 


1. In Section 3.1, we show that in some problems n = Vl{{d\ogd)/^i, + I/tt*) observations are necessary 
for any procedure to guarantee constant multiplicative accuracy in estimating 7 * (Theorems 1 and 2). 
Essentially, in some problems every state may need to be visited about log(d)/ 7 * times, on average, 
before an accurate estimate of the mixing time can be provided, regardless of the actual estimation 
procedure used. 

2. In Section 3.2, we give a point-estimator for 7 *, and prove in Theorem 3 that it achieves multiplicative 
accuracy from a single sample path of length 0 (l/( 7 r* 7 *)).^ We also provide a point-estimator for tt* 
that requires a sample path of length 0 (l/( 7 r* 7 *)). This establishes the feasibility of estimating the 
mixing time in this setting. However, the valid confidence intervals suggested by Theorem 3 depend on 
the unknown quantities tt* and 7 *. We also discuss the importance of reversibility, and some possible 
extensions to non-reversible chains. 

3. In Section 4, the construction of valid fully empirical confidence intervals for tt* and 7 * are considered. 
First, the difficulty of the task is explained, i.e., why the standard approach of turning the finite time 
conhdence intervals of Theorem 3 into a fully empirical one fails. Combining several results from 
perturbation theory in a novel fashion we propose a new procedure and prove that it avoids slow 
convergence (Theorem 4). We also explain how to combine the empirical confidence intervals from 
Algorithm 1 with the non-empirical bounds from Theorem 3 to produce valid empirical confidence 
intervals. We prove in Theorem 5 that the width of these new intervals converge to zero asymptotically 
at least as fast as those from either Theorem 3 and Theorem 4. 


Related work. There is a vast statistical literature on estimation in Markov chains. For instance, it is 
known that under the assumptions on (W)* from above, the law of large numbers guarantees that the sample 
mean 7r„(/) := i X]"=i fi^t) converges almost surely to 7r(/) [32], while the central limit theorem tells us 
that as n —> 00 , the distribution of the deviation '/n{7Tn{f) — 7 r(/)) will be normal with mean zero and 
asymptotic variance hm„_>.oo n Var {iZnif)) [21]. 

Although these asymptotic results help us understand the limiting behavior of the sample mean over 
a Markov chain, they say little about the finite-time non-asymptotic behavior, which is often needed for 
the prudent evaluation of a method or even its algorithmic design [3, 11, 12, 17, 23, 26, 29, 33, 43]. To 
address this need, numerous works have developed Chernoff-type bounds on Pr(j 7 r„(/) — 7 r(/)j > e), thus 
providing valuable tools for non-asymptotic probabilistic analysis [16, 23, 24, 37]. These probability bounds 
are larger than corresponding bounds for independent and identically distributed (iid) data due to the 
temporal dependence; intuitively, for the Markov chain to yield a fresh draw Xfi that behaves as if it was 
independent of X*, one must wait 0(tmix) time steps. Note that the bounds generally depend on distribution- 
specific properties of the Markov chain (e.g., R, tmix, 7 *), which are often unknown a priori in practice. 
Consequently, much effort has been put towards estimating these unknown quantities, especially in the 
context of MCMC diagnostics, in order to provide data-dependent assessments of estimation accuracy [e.g., 
1, 11, 12, 15, 17, 19]. However, these approaches generally only provide asymptotic guarantees, and hence 
fall short of our goal of empirical bounds that are valid with any finite-length sample path. 

Learning with dependent data is another main motivation to our work. Many results from statistical 
learning and empirical process theory have been extended to sufficiently fast mixing, dependent data [e.g., 

^The O(-) notation suppresses logarithmic factors. 
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14, 20, 34, 35, 39, 40, 45], providing learnability assurances (e.g., generalization error bounds). These results 
are often given in terms of mixing coefficients, which can be consistently estimated in some cases [30]. 
However, the convergence rates of the estimates from [30], which are needed to derive confidence bounds, 
are given in terms of unknown mixing coefficients. When the data comes from a Markov chain, these mixing 
coefficients can often be bounded in terms of mixing times, and hence our main results provide a way to 
make them fully empirical, at least in the limited setting we study. 

It is possible to eliminate many of the difhculties presented above when allowed more flexible access to the 
Markov chain. For example, given a sampling oracle that generates independent transitions from any given 
state (akin to a “reset” device), the mixing time becomes an efhciently testable property in the sense studied 
in [4, 5]. On the other hand, when one only has a circuit-based description of the transition probabilities 
of a Markov chain over an exponentially-large state space, there are complexity-theoretic barriers for many 
MCMC diagnostic problems [ 8 ]. 

2 Preliminaries 

2.1 Notations 

We denote the set of positive integers by N, and the set of the first d positive integers {1,2,... ,d} by [d]. 
The non-negative part of a real number x is [a;]+ := max{0,a:}, and [a:]+ := max{0, [cc]}. We use In(-) 
for natural logarithm, and log(-) for logarithm with an arbitrary constant base. Boldface symbols are used 
for vectors and matrices (e.g., v, M), and their entries are referenced by subindexing (e.g., Vi, Mij). For 
a vector v, jjvjj denotes its Euclidean norm; for a matrix M, \\M\\ denotes its spectral norm. We use 
Diag(t;) to denote the diagonal matrix whose (z,i)-th entry is Vi. The probability simplex is denoted by 
= {p G [0,1]'^ : J2i=iPi — I}? regard vectors in as row vectors. 

2.2 Setting 

Let P G (A‘^“^)‘^ C [0,1]'^^'^ be a d X d row-stochastic matrix for an ergodic (i.e., irreducible and aperiodic) 
Markov chain. This implies there is a unique stationary distribution tt G A^^”^ with > 0 for all i G [d] [25, 
Corollary 1.17]. We also assume that P is reversible (with respect to tt): 

'^iPi.j — bj ^ [^] ■ ( 1 ) 

The minimum stationary probability is denoted by tt* := min^gj^;] tt^. 

Define the matrices 

AT := Diag( 7 r)P and i := Diag( 7 r)~^/^AT Diag( 7 r)“^/^ . 

The {i, j)th entry of the matrix Mij contains the doublet probabilities associated with P: Mij = TTiPt^j is the 
probability of seeing state i followed by state j when the chain is started from its stationary distribution. The 
matrix M is symmetric on account of the reversibility of P, and hence it follows that L is also symmetric. 
(We will strongly exploit the symmetry in our results.) Further, L = Diag( 7 r)^/^PDiag( 7 r)“^/^, hence L 
and P are similar and thus their eigenvalue systems are identical. Ergodicity and reversibility imply that 
the eigenvalues of L are contained in the interval (—1,1], and that 1 is an eigenvalue of L with multiplicity 
1 [25, Lemmas 12.1 and 12.2]. Denote and order the eigenvalues of L as 

1 = Ai > A 2 > ••• > Ad > -1. 


Let A* := max{A 2 , jAd]}, and define the (absolute) spectral gap to be 7 * := 1 — A*, which is strictly positive 
on account of ergodicity. 

Let (At)tgN be a Markov chain whose transition probabilities are governed by P. For each t G N, let 
7 r(*) G A'^”^ denote the marginal distribution of Xt, so 

7rd+i) =7r(‘)p, ten. 
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Note that the initial distribution is arbitrary, and need not be the stationary distribution tt. 

The goal is to estimate tt* and 7 * from the length n sample path {Xt)t^[n], and also to construct fully 
empirical confidence intervals that tt* and 7 * with high probability; in particular, the construction of the 
intervals should not depend on any unobservable quantities, including tt* and 7 * themselves. As mentioned 
in the introduction, it is well-known that the mixing time of the Markov chain tmix (defined in Eq. 1) is 
bounded in terms of tt* and 7 *, as shown in Eq. (2). Moreover, convergence rates for empirical processes on 
Markov chain sequences are also often given in terms of mixing coefficients that can ultimately be bounded 
in terms of tt* and 7 * (as we will show in the proof of our first result). Therefore, valid confidence intervals 
for TT* and 7 * can be used to make these rates fully observable. 

3 Point estimation 

In this section, we present lower and upper bounds on achievable rates for estimating the spectral gap as a 
function of the length of the sample path n. 

3.1 Lower bounds 

The purpose of this section is to show lower bounds on the number of observations necessary to achieve 
a fixed multiplicative (or even just additive) accuracy in estimating the spectral gap 7 *. By Eq. (2), the 
multiplicative accuracy lower bound for 7 * gives the same lower bound for estimating the mixing time. Our 
first result holds even for two state Markov chains and shows that a sequence length of n(l/ 7 r*) is necessary 
to achieve even a constant additive accuracy in estimating 7 *. 

Theorem 1. Pick any tt G (0,1/4). Consider any estimator 7 * that takes as input a random sample 
path of length n < l/( 47 r) from a Markov chain starting from any desired initial state distribution. There 
exists a two-state ergodic and reversible Markov chain distribution with spectral gap 7 * > 1/2 and minimum 
stationary probability tt* > tt such that 


Pr[| 7 *- 7 .|>l/ 8 ]> 3 / 8 . 

Next, considering d state chains, we show that a sequence of length n(dlog(d)/ 7 *) is required to estimate 
7 * up to a constant multiplicative accuracy. Essentially, the sequence may have to visit all d states at least 
log(c?)/ 7 * times each, on average. This holds even if tt* is within a factor of two of the largest possible value 
of 1/d that it can take, i.e., when tt is nearly uniform. 

Theorem 2. There is an absolute constant c > 0 such that the following holds. Pick any positive integer 
d > 3 and any 7 S (0,1/2). Consider any estimator 7 * that takes as input a random sample path of length 
n < cdlog{d)/j from a d-state reversible Markov chain starting from any desired initial state distribution. 
There is an ergodic and reversible Markov chain distribution with spectral gap 7 * G [ 7 , 27 ] and minimum 
stationary probability tt* > l/( 2 d) such that 

Pr[| 7 *- 7 *| > 7 / 2 ] > 1/4. 

The proofs of Theorems 1 and 2 are given in Appendix A. 

3.2 A plug-in based point estimator and its accuracy 

Let us now consider the problem of estimating 7 *. Eor this, we construct a natural plug-in estimator. Along 
the way, we also provide an estimator for the minimum stationary probability, allowing one to use the bounds 
from Eq. (2) to trap the mixing time. 
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Define the random matrix M € [0, 1 ]'^^'^ and random vector tt € ^ by 


Mt,j ■■= 


n — 1 

|{t G [n] : Xt = i}\ 


e M], 


n 


, i€[d]. 


Furthermore, dehne 


Sym(i):=-(i: + i ) 


to be the symmetrized version of the (possibly non-symmetric) matrix 

L := Diag(-7t)“^/^Af Diag(7r)“^/^. 

Let Ai > A 2 > • • • > Ad be the eigenvalues of Sym(il). Our estimator of the minimum stationary probability 
TT* is 

TT* := minTTi, 
iG[d] 

and our estimator of the spectral gap 7 * is 

7 * := 1 - max{A 2 , |Ad|}. 


These estimators have the following accuracy guarantees: 

Theorem 3. There exists an absolute constant C > 0 such that the following holds. Assume the estimators 
TT* and 7 * described above are formed from a sample path of length n from an ergodic and reversible Markov 
chain. Let 7 * > 0 denote the spectral gap and tt* > 0 the minimum stationary probability. For any S G (0,1), 
with probability at least \ — 5, 


and 


- TT, 




I7* 




7r*7*n 



(4) 

(5) 


Theorem 3 implies that the sequence lengths required to estimate tt* and 7 * to within constant multi¬ 
plicative factors are, respectively, 

0 f ^ and 0 (■ 

By Eq. (2), the second of these is also a bound on the required sequence length to estimate tmix- 

The proof of Theorem 3 is based on analyzing the convergence of the sample averages M and tt to 
their expectation, and then using perturbation bounds for eigenvalues to derive a bound on the error of 
7 *. However, since these averages are formed using a single sample path from a (possibly) non-stationary 
Markov chain, we cannot use standard large deviation bounds; moreover applying Chernoff-type bounds for 
Markov chains to each entry of M would result in a significantly worse sequence length requirement, roughly 
a factor of d larger. Instead, we adapt probability tail bounds for sums of independent random matrices [44] 
to our non-iid setting by directly applying a blocking technique of [7] as described in the article of [45] . Due 
to ergodicity, the convergence rate can be bounded without any dependence on the initial state distribution 
The proof of Theorem 3 is given in Appendix B. 

Note that because the eigenvalues of L are the same as that of the transition probability matrix P, we 
could have instead opted to estimate P, say, using simple frequency estimates obtained from the sample path. 
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and then computing the second largest eigenvalue of this empirical estimate P. In fact, this approach is a 
way to extend to non-reversible chains, as we would no longer rely on the symmetry of M or L. The difficulty 
with this approach is that P lacks the structure required by certain strong eigenvalue perturbation results. 
One could instead invoke the Ostrowski-Elsner theorem [cf. Theorem 1.4 on Page 170 of 41], which bounds 
the matching distance between the eigenvalues of a matrix A and its perturbation A-\- E hy 
Since jjP — Pjj is expected to be of size this approach will give a confidence interval for 7 * whose 

width shrinks at a rate of —an exponential slow-down compared to the rate from Theorem 3. As 

demonstrated through an example from [41] , the dependence on the d-th root of the norm of the perturbation 
cannot be avoided in general. Our approach based on estimating a symmetric matrix affords us the use of 
perturbation results that exploit more structure. 

Returning to the question of obtaining a fully empirical confidence interval for 7 * and tt*, we notice that, 
unfortunately. Theorem 3 falls short of being directly suitable for this, at least without further assumptions. 
This is because the deviation terms themselves depend inversely both on 7 * and tt*, and hence can never 
rule out 0 (or an arbitrarily small positive value) as a possibility for 7 * or tt*.^ In effect, the fact that the 
Markov chain could be slow mixing and the long-term frequency of some states could be small makes it 
difficult to be confident in the estimates provided by 7 * and tt*. This suggests that in order to obtain fully 
empirical confidence intervals, we need an estimator that is not subject to such effects—we pursue this in 
Section 4. Theorem 3 thus primarily serves as a point of comparison for what is achievable in terms of 
estimation accuracy when one does not need to provide empirical confidence bounds. 


4 Fully empirical confidence intervals 

In this section, we address the shortcoming of Theorem 3 and give fully empirical confidence intervals 
for the stationary probabilities and the spectral gap 7 *. The main idea is to use the Markov property 
to eliminate the dependence of the confidence intervals on the unknown quantities (including tt* and 7 *). 
Specifically, we estimate the transition probabilities from the sample path using simple frequency estimates: 
as a consequence of the Markov property, for each state, the frequency estimates converge at a rate that 
depends only on the number of visits to the state, and in particular the rate (given the visit count of the 
state) is independent of the mixing time of the chain. 

As discussed in Section 3, it is possible to form a confidence interval for 7 * based on the eigenvalues 
of an estimated transition probability matrix by appealing to the Ostrowski-Elsner theorem. However, as 
explained earlier, this would lead to a slow rate. We avoid this slow rate by using an estimate 

of the symmetric matrix T, so that we can use a stronger perturbation result (namely Weyl’s inequality, as 
in the proof of Theorem 3) available for symmetric matrices. 

To form an estimate of L based on an estimate of the transition probabilities, one possibility is to 
estimate tt using a frequency-based estimate for tt as was done in Section 3, and appeal to the relation 
L = Diag( 7 r)^/^PDiag( 7 r)“^/^ to form a plug-in estimate. However, as noted in Section 3.2, confidence 
intervals for the entries of tt formed this way may depend on the mixing time. Indeed, such an estimate of 
TT does not exploit the Markov property. 

We adopt a different strategy for estimating tt, which leads to our construction of empirical confidence 
intervals, detailed in Algorithm 1 . We form the matrix P using smoothed frequency estimates of P (Step 1), 
then compute the so-called group inverse of A = I—P (Step 2), followed by finding the unique stationary 
distribution n oi P (Step 3), this way decoupling the bound on the accuracy of tt from the mixing time. The 
group inverse A* of A is uniquely defined; and if P defines an ergodic chain (which is the case here due to 
the use of the smoothed estimates), A^ can be computed at the cost of inverting an (d—l) x (d—l) matrix [31, 
Theorem 5.2].^ Further, once given A"^, the unique stationary distribution tt of f* can be read out from 

^Using Theorem 3 , it is possible to trap 7* in the union of two empirical confidence intervals—one around 7* and the other 
around zero, both of which shrink in width as the sequence length increases. 

^ The group inverse of a square matrix A, a special case of the Drazin inverse^ is the unique matrix satisfying AA"^ A = 

A, A^^AA"^ = A"^ and A'^A = AA"^. 
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Algorithm 1 Empirical confidence intervals 

Input: Sample path (Ai, A 2 ,..., A„), confidence parameter S G (0,1). 

1; Compute state visit counts and smoothed transition probability estimates: 


iVj := \{t e [n - 1] : Xt = i}\, i € [d]; 

Nij := \{t G [n- 1] : {Xt,Xt+i) = , {i,j) G [d]^ 


2 : Let be the group inverse of A := / — P. 

3: Let TT G be the unique stationary distribution for P. 

4: Compute eigenvalues Ai>A 2 > • • • >\d of Sym(X), where L := Diag(7r)^/^PDiag(7r)“^/^. 
5: Spectral gap estimate: 

7* := 1 - max{A2, |Ad|}. 


6: Empirical bounds for \Pi,j—Pi,j\ for (i, j)G[d]^: c := 1.1, := inf{t > 0 : 2d^(l + [log^ * < d}, 


j I lcTn,5 , CTn^S , 2cPij{l - P^j)Tn,S , (4/3)t„,5 + IP 7 - l/d| 

+ - N, -+- N, - 


7: Relative sensitivity of tt: 


:= i max \^A*. - min ^A*. : z G [d] | : j G [d] |. 


8: Empirical bounds for max^gj^;] |7rj — zTij and max Uze [dl {I ~ ^1’ 1|}- 


b := kmax^Bij : {i,j) G [d]^| , p := i max [J 


A ^ 

TT* ’ [TTj - 


9: Empirical bounds for jy* — 7 *|: 


:—2/5 +/5^ + (1 + 2p + ^ “Ay 


{i,j)e[d]^ 


1/2 






the last row of [31, Theorem 5.3]. The group inverse is also be used to compute the sensitivity of tt. 
Based on tt and P, we construct the plug-in estimate L oi L, and use the eigenvalues of its symmetrization 
to form the estimate 7 * of the spectral gap (Steps 4 and 5). In the remaining steps, we use perturbation 
analyses to relate tt and tt, viewing P as the perturbation of P; and also to relate % and 7 *, viewing L as 
a perturbation of Sym(i). Both analyses give error bounds entirely in terms of observable quantities (e.g., 
k), tracing back to empirical error bounds for the smoothed frequency estimates of P. 

The most computationally expensive step in Algorithm 1 is the computation of the group inverse , 
which, as noted reduces to matrix inversion. Thus, with a standard implementation of matrix inversion, the 
algorithm’s time complexity is 0{n + d^), while its space complexity is 0{d^). 

To state our main theorem concerning Algorithm 1, we first define n to be analogous to k from Step 7, 
with A^ replaced by the group inverse A^ oi A := I — P. The result is as follows. 


Theorem 4. Suppose Algorithm 1 is given as input a sample path of length n from an ergodic and reversible 
Markov chain and confidence parameter 6 € (0,1). Letji, > 0 denote the spectral gap, tt the unique stationary 
distribution, and tt* > 0 the minimum stationary probability. Then, on an event of probability at least 1 — (5, 


TTi € [iTi — b,TTi + b] for all i G [d], and 7 * G [ 7 * — u), 7 * -f w]. 
Moreover, b and w almost surely satisfy (as n ^ 00 ) 


b = O \ max k 
\do)eldV 


log log n 
TT^n 


/loglogn ^ /dloglogu I 4 


TT*n 


7r*n 


The proof of Theorem 4 is given in Appendix C. As mentioned above, the obstacle encountered in 
Theorem 3 is avoided by exploiting the Markov property. We establish fully observable upper and lower 
bounds on the entries of P that converge at a -^/n/ loglogn rate using standard martingale tail inequalities; 
this justifies the validity of the bounds from Step 6 . Properties of the group inverse [10, 31] and eigenvalue 
perturbation theory [41] are used to validate the empirical bounds on and 7 * developed in the remaining 
steps of the algorithm. 

The first part of Theorem 4 provides valid empirical confidence intervals for each and for 7 *, which 
are simultaneously valid at confidence level S. The second part of Theorem 4 shows that the width of the 
intervals decrease as the sequence length increases. We show in Appendix C.5 that k < d/ 7 *, and hence 


b = 0 


max 


d /Pij loglogn 


= O 


7r*7* 


log log n 
7r*n 


It is easy to combine Theorems 3 and 4 to yield intervals whose widths shrink at least as fast as both the 
non-empirical intervals from Theorem 3 and the empirical intervals from Theorem 4. Specifically, determine 
lower bounds on tt* and 7 * using Algorithm 1 , 

TT* > min[7ri - S] + , 7 * > [ 7 * - w]+; 

ie[d] 


then plug-in these lower bounds for tt* and 7 * in the deviation bounds in Eq. (5) from Theorem 3. This yields 
a new interval centered around the estimate of 7 * from Theorem 3, and it no longer depends on unknown 
quantities. The interval is a valid 1 — 2i5 probability confidence interval for 7 *, and for sufficiently large n, the 
width shrinks at the rate given in Eq. (5). We can similarly construct an empirical confidence interval for tt* 
using Eq. (4), which is valid on the same 1 — 2(5 probability event.® Finally, we can take the intersection of 
these new intervals with the corresponding intervals from Algorithm 1. This is summarized in the following 
theorem, which we prove in Appendix D. 

“^In Theorems 4 and 5, our use of big-O notation is as follows. For a random sequence {Yn)n and a (non-random) positive 
sequence (£fl,n)n parameterized by 9, we say “Yn = 0{ee^„) holds almost surely as n —>■ 00 ” if there is some universal constant 
C > 0 such that for all 9, limsup,j_^j,^ W/se.n ^ C holds almost surely. 

®For the tt* interval, we only plug-in lower bounds on tt* and 7 * only where these quantities appear as I/tt* and I/ 7 * in 
Eq. (4). It is then possible to “solve” for observable bounds on tt*. See Appendix D for details. 















Theorem 5. The following holds under the same conditions as Theorem 4- For any 6 G (0,1), the confidence 
intervals U and V described above for tt* and 7*, respectively, satisfy G U and 7* G with probability at 
least 1 — 25. Furthermore, the widths of these intervals almost surely satisfy (as n ^ ooj 


\u\ = o 



|T| = O I min ■ 


/logf 


■ log(n) 


7r*7*n 


where w is the width from Algorithm 1. 


5 Discussion 

The construction used in Theorem 5 applies more generally: Given a confidence interval of the form 
In = In{l*,T^*i5) for some confidence level <5 and a fully empirical confidence set En{S) for (7*,7r*) for 
the same level, = En{S) fl ^(■y^Tr)£E„{S)In{'J,'^,5) is a valid fully empirical 2^-level confidence interval 
whose asymptotic width matches that of In up to lower order terms under reasonable assumptions on En 
and In- In particular, this suggests that future work should focus on closing the gap between the lower and 
upper bounds on the accuracy of point-estimation. Another interesting direction is to reduce the computa¬ 
tion cost: The current cubic cost in the number of states can be too high even when the number of states is 
only moderately large. 

Perhaps more important, however, is to extend our results to large state space Markov chains: In most 
practical applications the state space is continuous or is exponentially large in some natural parameters. As 
follows from our lower bounds, without further assumptions, the problem of fully data dependent estimation 
of the mixing time is intractable for information theoretical reasons. Interesting directions for future work 
thus must consider Markov chains with specific structure. Parametric classes of Markov chains, including but 
not limited to Markov chains with factored transition kernels with a few factors, are a promising candidate 
for such future investigations. The results presented here are a first step in the ambitious research agenda 
outlined above, and we hope that they will serve as a point of departure for further insights in the area of 
fully empirical estimation of Markov chain parameters based on a single sample path. 
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A Proofs of the lower bounds 

Theorem 6 (Theorem 1 restated). Pick any it G (0,1/4). Consider any estimator 7 * that takes as input 
a random sample path of length n < 1/{Jit) from a Markov chain starting from any desired initial state 
distribution. There exists a two-state ergodic and reversible Markov chain distribution with spectral gap 
7 * > 1/2 and minimum stationary probability tt* > tt such that 

Pr[| 7 *- 7 *l>l/ 8 ]> 3 / 8 . 
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Proof. Fix TT G (0,1/4). Consider two Markov chains given by the following stochastic matrices: 


I — TT TT 

p(2) — 

I — TT TT 

I — 7 f TT 

5 ■ — 

1/2 1/2 


Each Markov chain is ergodic and reversible; their stationary distributions are, respectively, = (1 — if, if) 
and = (1/(1 + 27f),27f/(l + 2Tf)). We have tt* > if in both cases. For the first Markov chain. A* = 0, 
and hence the spectral gap is 1; for the second Markov chain. A* = 1/2 — if, so the spectral gap is 1/2 + if. 

In order to guarantee I 7 * — 7 *| < 1/8 < |1 — (1/2 + 7 f)|/ 2 , it must be possible to distinguish the two 
Markov chains. Assume that the initial state distribution has mass at least 1/2 on state 1. (If this is not 
the case, we swap the roles of states I and 2 in the constructions above.) With probability at least half, 
the initial state is I; and both chains have the same transition probabilities from state 1. The chains are 
indistinguishable unless the sample path eventually reaches state 2. But with probability at least 3/4, a 
sample path of length n < 1/ (Air) starting from state I always remains in the same state (this follows from 
properties of the geometric distribution and the assumption ff < 1/4). □ 

Theorem 7 (Theorem 2 restated). There is an absolute constant c > 0 such that the following holds. Pick 
any positive integer d > 3 and any 7 G (0, 1/2). Consider any estimator 7 * that takes as input a random 
sample path of length n < cdlog(d )/7 from a d-state reversible Markov chain starting from any desired initial 
state distribution. There is an ergodic and reversible Markov chain distribution with spectral gap 7 * G [ 7 , 27 ] 
and minimum stationary probability tt* > l/( 2 d) such that 

Pr[| 7 *- 7 *| > 7 / 2 ] > 1/4. 

Proof. We consider d-state Markov chains of the following form: 


P — 


l-Ei \ii=j\ 


for some £ 1 , 82 , ■■■ ,Sd G (0,1). Such a chain is ergodic and reversible, and its unique stationary distribution 
TT satisfies 

_l/e* 

TT j — I . 

eU Vfo- 

We fix £ := and set e' := < e. Consider the following d + 1 different Markov chains of the type 

described above: 

• El = ■ ■ ■ = Ed = £. For this Markov chain, A 2 = Ad = A* = 1 — 


• for i G [d]: Sj = £ for j ^ i, and £i = e'. For these Markov chains, A2 = 1 — e' — ~ 

and Ad = 1 - d^£- So A* = 1 - 

The spectral gap in each chain satisfies 7* G [7, 27]; in P*-*^ for i £ [d], it is half of what it is in Also 

> l/(2d) for each i £ [d]. 

In order to guarantee I 7 * — 7 *| < 7 / 2 , it must be possible to distinguish P^°^ from each P^®\ i £ [d]. 
But P^°^ is identical to P^*^ except for the transition probabilities from state i. Therefore, regardless of the 
initial state, the sample path must visit all states in order to distinguish P^°^ from each P^*^ , i £ [d]. For any 
of the d + 1 Markov chains above, the earliest time in which a sample path visits all d states stochastically 
dominates a generalized coupon collection time T = 1 + rpi r, where Ti is the number of steps required to 
see the {i + l)-th distinct state in the sample path beyond the first i. The random variables Ti,T 2 ,..., Td-i 
are independent, and are geometrically distributed, Ti ^ Geom(e — {i — l)e/ (d — 1)). We have that 


E[T,] = 


d- I 
e(d — i) ’ 


var(rj) = 


I - 


£■ 


d—i 

d-1 



2 • 
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Therefore 


E[T] = l + ^-^Hd-i, var(T)< 


d — I 


TT 

Y 


where Hd-i = 1 + 1/2 + 1/3 + • • • + l/((f — 1). By the Paley-Zygmund inequality, 


Pr 


T > -E[r] 


> 


1 

var(T) 

(1-1/3PE[T]2 


1 


(4/9) (^^2)" 



Since n < cd\og{d)/^ < (1/3)(1 + (d — l)idd-i/( 27 )) = E[T']/3 (for an appropriate absolute constant c), 
with probability at least 1/4, the sample path does not visit all d states. □ 

We claim in Section 1 that a sample path of length r2((dlogd)/7* + I/tt*) is required to guarantee 
constant multiplicative accuracy in estimating 7 *. This follows by combining Theorems 1 and 2 in a standard, 
straightforward way. Specifically, if the length of the sample path n is smaller than (ni + n2)/2—where ni 
is the lower bound from Theorem 1, and 712 is the lower bound from Theorem 2 —then n is smaller than 
maxjni, 712 }- So at least one of Theorem 1 and Theorem 2 implies the existence of an ergodic and reversible 
Markov chain distribution with spectral gap 7 * and stationary distribution minorized by tt* such that 


Pr[| 7 * - 7*1 > 7 */ 4 ] > 1/4. 


B Proof of Theorem 3 

In this section, we prove Theorem 3. 


B.l Accuracy of vr* 

We start by proving the deviation bound on tt* — tt*, from which we may easily deduce Eq. (4) in Theorem 3. 
Lemma 1. Pick any 6 G (0,1), and let 


With probability at least 1 


‘S’n • — 


7*41 


5, the following inequalities hold simultaneously: 

\^i - TTil < v'87ri(l - TTi)en + 20e„ for all i G [d]; 
lY - 7r*| < 4.y7r*e„ + 47e„. 


( 6 ) 

(7) 

( 8 ) 


Proof. We use the following Bernstein-type inequality for Markov chains from [37, Theorem 3.8]: letting P'^ 
denote the probability with respect to the stationary chain (where the marginal distribution of each Xt is 
tt), we have for every e > 0, 


To handle possibly non-stationary chains, as is our case, we combine the above inequality with [37, Propo¬ 
sition 3.14], to obtain for any e > 0, 


Pdifi - TTi] > e) < 



- TTi] > e) < 



717* e"- 


87rj(l — TTi) + 20e 


Using this tail inequality with e := \/?>TTi{l — TTi)£n + 20£„ and a union bound over all i G [d] implies that 
the inequalities in Eq. (7) hold with probability at least 1 — d. 
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Now assume this 1 — <5 probability event holds; it remains to prove that Eq. (8) also holds in this event. 
Without loss of generality, we assume that tt* = tti < 7r2 < • • • < tt^. Let j G [d] be such that tt* = By 
Eq. (7), we have Itt^ — TTij < -v/STrie^ + 20£„ for each i G {1, j}. Since tt* < tti, 

TT^ ^ ifl 'TTl ^ ^ TT^ 22£‘yj 

where the last inequality follows by the AM/GM inequality. Eurthermore, using the fact that a < b^/a + c => 
a < 6^ + by/c + c for nonnegative numbers a, &, c > 0 [see, e.g., 9] with the inequality tt^ < + (tt^ + 

20£n) gives _ 

TTj < T^j + + 20e„)£„ + 28e„. 

Therefore 

TT* - TT* < TTj - TTj < \/8(7r* + 20£„)£„ + 28£„ < v^8(27r* + 42£„)£„ + 28£„ < 4^7r*£„ + 47£„ 

where the second-to-last inequality follows from the above bound on tt* — tt*, and the last inequality uses 
\/a + b < ^Ja + 'Jh for nonnegative a,b > 0. □ 


B.2 Accuracy of 7 * 

Let us now turn to proving Eq. (5), i.e., the bound on the error of the spectral gap estimate 7 *. The accuracy 
of 7 * is based on the accuracy of Sym(L) in approximating L via Weyl’s inequality: 

|Ai — Aij < II Sym(L) — i|| for all i G [d]. 


Moreover, the triangle inequality implies that symmetrizing L can only help: 

||Sym(£)-L|| < ||£-L||. 


Therefore, we can deduce Eq. (5) in Theorem 3 from the following lemma. 

Lemma 2. There exists an absolute constant C > 0 such that the following holds. For any 5 G (0,1), with 
probability at least 1 — <5, the bounds from Lemma 1 hold, and 


\L-L\\ < C 


<v\ 


log (y) log (^) ^ log(^ 

o ■ 


7i'*7*n 


l*n 


The remainder of this section is devoted to proving this lemma. 

The error L — L may be written as 

L — L — £m + £--irL + L£t^ + + £m£--k + £--k£m£--k , 


where 


:= Diag(7r) Diag(7r)^/^ - I, 

£m ■■= Diag ( 7 r )-^/2 Diag ( 7 r )- i /^ 

Therefore 

||£ - L|| < ll^iull + (ll^iull + ||L||) (2||£.|| + ||5.f) . 


Since ||L|| < 1 and ||L|| < 1 [25, Lemma 12.1], we also have ||L — L|| < 2. Therefore, 

||L - ill < min {II^A^II + (||£a^|| + ||L||) (2||£^|| + ||£^f), 2} < 3(||5 m|| + ||5.||). 


(9) 


14 













B.3 A bound on ||5,r 

Since is diagonal, 


= max 
^€[d] 



Assume that 


n > 


108 In (I yj) 




( 10 ) 


in which case _ 

\/87ri(l - TTi)sn + 20e„ < y 

where e„ is as defined in Eq. (6). Therefore, in the 1 — 5 probability event from Lemma 1, we have 
Ki — TTil < 7ri/2 for each i G [d], and moreover, 2/3 < TTi/iri < 2 for each i G [d]. For this range of TTijTTi, we 
have 



We conclude that if n satisfies Eq. (10), then in this 1 — 5 probability event from Lemma 1, 


II^ttII < max 

i€i[d] 





< max 

iG[d] 


\/87r^(l 


^i')^n T 20£^yj 

TTi 


< 



20gn 

TT* 





7r*7*n 


7r*7*n 


( 11 ) 


B.4 Accnracy of doublet frequency estimates (bonnding ||£m||) 

In this section we prove a bound on ||- For this, we decompose Em — Diag(7r)“^/^(7Vf—AT) Diag(7r)~^/^ 
into E (Em) and Em — E {Em)-, the first measuring the effect of a non-stationary start of the chain, while 
the second measuring the variation due to randomness. 


B.4.1 Bounding ||E(£'jvf) ||: The price of a non-stationary start. 

Let be the distribution of states at time step t. We will make use of the following proposition, which 
can be derived by following [36, Proposition 1.12]: 

Proposition 1. For t > 1, let be the vector with and let jj • || 2 , 7 r denote the n-weighted 

2 -norm 


1/2 


l^lb.TT := 


E 


TTiW,; 


( 12 ) 


Then, 


ItW - 111 


(1-7.)*-^ 


An immediate corollary of this result is that 


Diag(7r^‘^) Diag(7r) ^ 


< 


(l-7.)‘ 


(13) 


(14) 
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Now note that 


^ n —1 

E(M) = -- V Diag(7r(*))P 

n — 1 

and thus 

E{£m) = Diag(7r)-i/2 (e(M) - m) Diag(7r)-i/2 

- n— 1 

=-Diag(7r)~^/^(Diag(7r^‘^) — Diag(7r))PDiag(7r)“^/^ 

n — 1 

t=i 

- n— 1 

= -Diag(7r)~^/^(Diag(7r*'‘h Diag(7r)“^ — 7)71^ Diag(7r)~^/^ 

^ n— 1 

=-r y'(Diag(7r(*)) Diag(7r)"^ - I)L . 

Combining this, ||i|| < 1 and Eq. (14), we get 


I|e(^m)II < 


1 


(n - l)TTi 


n—1 

■E 


(i-7*r^< 


(n 


1 

1 ) 7 * 7 ^*' 


(15) 


B.4.2 Bounding \\Sm —E{Sm) |h Application of a matrix tail inequality 

In this section we analyze the deviations of Sm —E{Em)- By the definition oi SM^ 

\\£m - E {£m) II = II Diag(7r)-i/2 (m - Em) Diag(7r)-i/2|| . (16) 

The matrix AT—E is defined as a sum of dependent centered random matrices. We will use the blocking 

technique of [7] to relate the likely deviations of this matrix to that of a sum of independent centered random 

matrices. The deviations of these will then bounded with the help of a Bernstein-type matrix tail inequality 

due to [44]. 

We divide [n — 1] into contiguous blocks of time steps; each has size a < n/3 except possibly the first 
block, which has size between a and 2a — 1. Formally, let a' := a + ((n — 1) mod a) < 2a — 1, and define 

F:= K], 

Hs ■= {t S [n — 1] : a^ -I- 2(s — l)a -|- 1 < f < a^ -I- (2s — l)a}, 

Ts := {f S [n — 1] : a' -I- (2s — l)a -I- 1 < f < a' -I- 2sa}, 

for s = 1,2,.... Let iih (resp., /i-r) be the number of non-empty Hg (resp., Tg) blocks. Let nn '■= ol^h 
(resp., ut '■= a^r) be the number of time steps in UsiTs (resp., UgTg). We have 



AIt 
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Here, is the i-th coordinate basis vector, so e^ej G {0 ,is a d x d matrix of all zeros except for a 1 
in the (i,j)-th position. 

The contribution of the first block is easily bounded using the triangle inequality: 


n — 1 


Diag(7r)-i/2 (^Mp - Diag(7r)-i/2 




teF 


Jit+i 




+ 


E 




< 


2a' 


F*(n - 1) ■ 


(18) 


It remains to bound the contributions of the Hg blocks and the Tg blocks. We just focus on the the Hg 
blocks, since the analysis is identical for the Tg blocks. 

Let 

1 


n < ^ 


t^Xt+i ’ ^ ^ [Mf]) 


teHe 


so 


1 


fJ-H 


MH = —J2Yg, 

an average of the random matrices Yg. For each s G [/iff], the random matrix Yg is a function of 


(Wi : a -t- 2(s — l)a T 1 ^ ^ a F (2s — l)a F 1) 


(note the Fl in the upper limit of t), so l^s+i is a time steps ahead of Yg. When a is sufficiently large, we 
will be able to effectively treat the random matrices Yg as if they were independent. In the sequel, we shall 
always assume that the block length a satisfies 


a> as '■= — In 
7* 


2(n- 2) 
Jtt* 


(19) 


for 6 G (0,1). 
Define 


riHs) 


n 


TT 


(H) 






TT 




tGHs 






Observe that 

E(y,) =Diag(7r(^“))P 

SO 

E ( — y Y, ) = Diag(7r(^) )P. 

Define 

Z, := Diag(7r)-i/2 (Y, - E(Y,)) Diag(7r)-i/2. 

We apply a matrix tail inequality to the average of independent copies of the Z^’s. More precisely, we will 
apply the tail inequality to independent copies Zg, s € [/iff] of the random variables Zg and then relate the 
average of Zg to that of Zg. The following probability inequality is from [44, Theorem 6.1.1.]. 

Theorem 8 (Matrix Bernstein inequality). Let Qi, Q 2 , ■ • ■, Qm ® sequence of independent, random di xd 2 
matrices. Assume that E {Qi) = 0 and IIQJj < R for each l<i<m. Let S = YaLi Qt and let 


i; = max{||EE.Q.Q^IU|EE,Q7Q.ll} ■ 
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Then, for all t > 0, 

P(||‘S'|| > i) < 2(di+d2)exp . 

In other words, for any 5 G (0,1), 


P (^\s\\ > ^ In ^ <<5. 

To apply Theorem 8, it suffices to bound the spectral norms of Zg (almost surely), E(ZsZj), and 
E(ZJZ,). 

Range bound. By the triangle inequality, 

\\Zs\\ < II Diag(7r)"^/^FsDiag(7r)"^/^|| + || Diag(7r)"^/2E(Y's) Diag(7r)”^/2|| . 

For the first term, we have 

II Diag(7r)"^/2ysDiag(7r)"^/^|| < —. (20) 

TT* 

For the second term, we use the fact ||T|| < 1 to bound 

II Diag(7r)~^/^(E(y g) - M) Diag(7r)~^/^|| = || (Diag(7r(^'’^) Diag(7r)"^ - /)i|| 

< II Diag(7r(^"))Diag(7r)"^ - J|| . 


Then, using Eq. (14), 


Diag(7r(^'’)) Diag(7r)-i - J|| < — 7^°+2(« ^ (1 —^ ^ ^ 


( 21 ) 


where the last inequality follows from the assumption that the block length a satisfies Eq. (19). Combining 
this with II Diag(7r)“^/^Af Diag(7r)“^/^|| = ||i|| < 1, it follows that 


||Diag(7r) ^/^E(Fs) Diag(7r) ^/^|| < 2 

by the triangle inequality. Therefore, together with Eq. (20), we obtain the range bound 

II^JI < -+2. 


( 22 ) 


Variance bound. We now determine bounds on the spectral norms of E(ZsZj) and E(Z'jZ's). Observe 
that 


EiZgZj) 

= ^ (Diag(7r)-i/2ex,eJ^^^ Diag(7r)"iex*+ieJ^ Diag(7r)-i/2^ (23) 

tGHs 

+ ^ X 'E(Diag(7r)“^/2exteJ^^^Diag(7r)-iex,,+,eJ^, Diag(7r)-^/2) (24) 

- Diag(7r)-i/2E(y,) Diag(7r)-iE(y:) Diag(7r)-i/2. (25) 
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The first sum, Eq. (23), easily simplifies to the diagonal matrix 


d d 


EEE FT{Xt=i,Xt+i =j). —e, 


t^Hs i—1 j—1 
^ d d 


TTiTTj 


ejejej 


= 4 E E E■ —<=.<=7 = - E—IE—I ■=•<=7 ■ 

O ^ TTiTT, a ^ TTi \ TT, / 


d (H,) / d 


1 tt) 




TTiTTj 


t^Hs i—1 j—1 

For the second sum, Eq. (24), a symmetric matrix, consider 

/ 


a TTi 

1 — 1 




E(Diag(7r) Diag(7r) 


\ 


V 




U 




ioT an arbitrary unit vector u. By Cauchy-Schwarz and AM/GM, this is bounded from above by 

^ E 'E(“^Diag(7r)-i/2exteJ,^^Diag(7r)-iext+ieJ^Diag(7r)-i/2„) 


t/t' 


which simplifies to 

a — 1 


+ E Diag(7r) Diag(7r) Diag(7r) 

m^E I Y Diag( 7 r)~i/ 2 g^^gT^^^ Diag(7r)~iex,+ieJ^ Diag(7r)-i/^ j u. 

\teHs / 

The expectation is the same as that for the first term, Eq. (23). 

Finally, the spectral norm of the third term, Eq. (25), is bounded using Eq. (22): 

II Diag(7r)~^/2E(Y's)Diag(7r)"^/2|p < 4. 

Therefore, by the triangle inequality, the bound tt) 'liii < 2 from Eq. (21), and simplifications. 


||E(Z,Zj)||<max|^^ ^+4<2max|^id^| +4. 

*eM \ TT, / TTj IG d \ TT, / 


{H) 




4G[d] \ . IT j I TTi 

\j = l ■' / 




We can bound E(ZJ Zg) in a similar way; the only difference is that the reversibility needs to be used at 
one place to simplify an expectation: 

Y IE (Diag(7r)~^/2ext+ieJ^ Diag(7r)"^exteJ^^^ Diag(7r)~^/2) 


tGHs 


= ^ E E E P'(A', = i, = j ). j-e,e] 


d d 


t^Hs i—1 j—1 


= ?EE E4 

teHs j=l \i=l 


.(*) p 

i 

1 

TTiTTj 

•^IL 


' TTi 

TTi I 
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where the last step uses Eq. (3). As before, we get 


||E(ZjZ,)||<max|^^.^ 

ie\d\ \ TTj 7T 


{H) 


+ 4 < 2 max V 

I t ^ nr- 




again using the bound Tr-'^^/TTi < 2 from Eq. (21). 


Independent copies bound. Let Zg for s G be independent copies of Zg for s G [idn]- Applying 
Theorem 8 to the average of these random matrices, we have 


^J-H 


S = 1 


> 


4 (dp + 2) in ^ ^ 2 + 2^ in ^ 


^J'H 




< <5 


(26) 


where 


, Pi,j , 1 

dp := max > —- < — . 

*6 M] ^ tt, tt* 

J =1 


The actual bound. To bound the probability that || X]s=i ^s/IJ- h\\ is large, we appeal to the following 
result, which is a consequence of [45, Corollary 2.7]. For each s G [fJ-n], let := {Xt : a' + 2(s — 

l)a + 1 < t < a' + (2s — l)a + 1), which are the random variables determining Zg. Let P denote the joint 
distribution of : s G [/rp]); let Pg be its marginal over X'^^‘'\ and let Pi:g+i be its marginal over 

...,Let P be the product distribution formed from the marginals Pi,P 2 ,... ,P/iH-, 
so P governs the joint distribution of (Zg : s G [Mff])- The result from [45, Corollary 2.7] implies for any 
event A, ^ 

iP(A)-p(A)i < 

where 

/3(P):= inax E ( Pi,g+i(. ..., A^^")) - Pg+i ). 

Here, || • ||tv denotes the total variation norm. The number /3(P) can be recognized to be the fi-mixing 
coefficient of the stochastic process {A(^'’)}gg[^^]. This result implies that the bound from Eq. (26) for 
II ■^s/dfl'll also holds for || ^s/tJ-nW, except the probability bound increases from d to <5 + (/rp — 
1)/3(P): 

(i-) 


-Ez. 


By the triangle inequality, 




> 


1 4 (dp + 2 ) In 


4d 

S 


Inf 


3^p 


/3(P)< max e( X^^^\ ..., 


< ^ + (dff - 1)/3(P)- 


+ IIP.+1-: 


(27) 


where P'^ is the marginal distribution of under the stationary chain. Using the Markov property and 

integrating out Xt for t > minidg+i = a' + 2sa + 1 , 

U,g+i(- |A(^i), A(^=), ..., A(^=)) - P- = \\C{Xa>+2sa+i |A,,+( 2 g_i),+i) - 7r|| 

tv 

where C{Y\Z) denotes the conditional distribution of Y given Z. We bound this distance using standard 
arguments for bounding the mixing time in terms of the relaxation time I/ 7 * [see, e.g., the proof of Theorem 
12.3 of 25]: for any i G [d], 

exp (- 07 *) 


|U(A(j/+2sa+l IA^a' + (2s—l)a+l — *) ^|ltv — ll'^(^o+l |A^1 — *) ’^lltv — 


TT* 
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The distance ||Ps+i — P’^Htv can be bounded similarly: 


■ S + l 


Itv — \\^{^a'+2sa+l) - Utv 
d 

^P(Xi = i)C{Xa'+2sa+l 1^1 = ^) - TT 


2=1 

d 


< Y, P(^l = *) \mXa'+2sa+l 1^1 =i)-' 


2=1 


^ exp (-(o' + 2 sq) 7 ^) ^ exp (-a 7 ^:) 


We conclude 


_ i)i2P(zf2d < 2 (n- 2 )exp(-a 7 .) ^ ^ 


where the last step follows from the block length assumption Eq. (19). 

We return to the decomposition from Eq. (17). We apply Eq. (27) to both the Hg blocks and the Tg 
blocks, and combine with Eq. (18) to obtain the following probabilistic bound. Pick any 6 G (0,1), let the 
block length be 


so 


If 


min{ = 


a ■— [aa] = 

n — 1 — a' 
2 a 


1 , 2(n-2) 

— In-— 

7* 

n — 1 


> 


(i + ii-^) 


— 2 =: fi. 


6 2(n-2) 

n> 7 -\ -In- 1 — > 3a, 

7* 


(28) 


then with probability at least 1 — 45, 

Diag(7r)~^/^ — E[7Vi’]^ Diag(7r)~^/^ 


< 


4 

1 2(n 2) 

7* 

_ 1 , 

U{dp + 2 )\nf 2 

+ 

to 

Inf 

7r*(n - 1) ' y 


3fi 



B.4.3 The bound on ||£^a^|| 

Combining the probabilistic bound from above with the bound on the bias from Eq. (15), we obtain the 
following. Assuming the condition on n from Eq. (28), with probability at least 1 — 45, 


\\^m\\ < 


1 4 

1 2(n-2) 

7* 


(n - l)7*7r* 

7r*(n - 1) 



+ 


M 

5 




(29) 
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B.5 Overall error bound 

Assume the sequence length n satisfies Eq. (10) and Eq. (28). Consider the 1 — 5(5 probability event in which 
Eqs. (7), ( 8 ) and (29) hold. In this event, Eq. (11) also holds, and hence by Eq. (9), 


\L-L\\ < 3 


\ 


8 In 


7r*7*n 


20 In 


7r*7*n 


+ 3 


(n - l) 7 * 7 r* 


^In 

7* 


2 (n-2) 

7r*(5 


,{n- 1 ) 


+ 


'4(dj 


2 )lnf 2 (^ + 2 ) Inf' 




3^ 


= O 


\ 


V 


log (f) log (777) 


7r*7*n 




To finish the proof of Lemma 2, we replace S with S/5, and now observe that the bound on ||i — L\\ 
is trivial if Eq. (10) is violated (recalling that \\L — i|| < 2 always holds). We tack on an additional term 
log(l/( 7 r* 7 *d))/( 7 *n) to also ensure a trivial bound if Eq. (28) is violated. □ 


C Proof of Theorem 4 

In this section, we derive Algorithm 1 and prove Theorem 4. 

C.l Estimators for tt and 7 * 

The algorithm forms the estimator P oi P using Laplace smoothing: 


where 

Nij := |{t e [n - 1] : {Xt,Xt+i) = {i,j)}\ , Ni := \{t £ [n - 1] ■. Xt = i}| 

and a > 0 is a positive constant, which we set beforehand as a := 1/d for simplicity. 

As a result of the smoothing, all entries of P are positive, and hence P is a transition probability matrix 
for an ergodic Markov chain. We let tt be the unique stationary distribution for P. Using tt, we form an 
estimator Sym(ll) of L using: 


Sym(i) := i(£ + i^), L := Diag( 7 r)i/ 2 pDiag(fr)-i/2. 

Let Ai > A 2 > • • • > Ad be the eigenvalues of Sym(Zl) (and in fact, we have 1 = Ai > A 2 and Ad > —1). The 
algorithm estimates the spectral gap 7 * using 

7 * := 1 - max{A2, |Ad|}. 


C.2 Empirical bounds for P 

We make use of a simple corollary of Ereedman’s inequality for martingales [13, Theorem 1.6]. 
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Theorem 9 (Freedman’s inequality). Let be a bounded martingale difference sequence with respect 

to the filtration C C C • • •; assume for some b > 0, \Yt\ < b almost surely for all t G N. Let 
:= Et=i E and Sk := Et=i Yt for fc G N. For all s,v>0, 

Pt |3t e N s,., S. > , A u < .] < e‘l‘ = exp ^ A . 

where h{u) := (1 + u) ln(l + u) — u. 

Observe that in Theorem 9, for any a: > 0, if s := \/2vx + bx/Z and z := b'^x/v^ then the probability 
bound on the right-hand side becomes 


exp 


/ h (v^ + z/3) \ 


< e 


since h{\/2z + zl?>)lz > 1 for all z > 0 (see, e.g., [2, proof of Lemma 5]). 

Corollary 1. Under the same setting as Theorem 9, for any n > 1, x > 0, and c > 1, 


Pr 


3k G [n] s.t. Sk > \j2cVkX + 3bx/Z < (1 + |’logc(2n/a;)]+) e 


Proof. Define Vi := c*6^a:/2 for i = 0,1, 2,, |’log^(2n/a:)]+ , and let v-i := —oo. Then, since 14 G [0, 6^n] 
for all fc G [n]. 


Pr 


3k G [n] s.t. Sk > \/2max{uo,ct4}a; + bx/3 

riogc(2n/x)l + 

= ^ Pr 3k G [n] s.t. Sk > \J2 max{wo, cl4 }x -I- bx/3 A Vi-i < 14 < 

i=0 


riog^(2ra/x)] + 

< E Pr 
2=0 


3k G [n] s.t. Sk > -\/2max{?;o,cz;i_i}a: -I- bx/3 A Vi-i < 14 < 


riogc(2n/x)l+ 

< ^ Pr [3k G [n] s.t. Sk > y/2viX + bx/3 A 14 < Vi] 

i=0 

< (1+ riogc(2n/a;)l+)e"“, 

where the final inequality uses Theorem 9. The conclusion now follows because 

yj2cVkX + Ahx/3 > y^2max{r;o,cl4}a: -b bx/3 


for all fc G [n]. □ 

Lemma 3. The following holds for any constant c > 1 with probability at least 1 — 6: for all {i,j) G [d]^, 


IP _ p I ^ f \ Pi,j)Tn,s (4/3)rn^^ \a daPij\ 

' - Y -bday NiPda Ni P da N, + da '' 

where 

Tn,s ■= inf {t > 0 : 2d^ (1 -b |■logc(2n/^)]+) e“‘ < d} = O ^log . 


(30) 


(31) 
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Proof. Let be the cr-field generated by Xi,X 2 ,..., Xf. Fix a pair (i,j) € [d]^. Let Yi := 0, and for t > 2, 


Yt := 1 = i} (1 {Xt = j} - P,,,), 


so that 

n 

=Nij -N.Pi^j. 

t=i 

The Markov property implies that the stochastic process (Yt)t^[n] is an (J't)-adapted martingale difference 
sequence: Yt is Pt-measurable and E {Yt\Xt-i) — 0, for each t. Moreover, for all t G [n], 

Yt £ 1 — Pij] , 


and for t > 2, 

E (F/|Pt-i) = 1 {Xt_i = i} P,,,(l - P,j). 
Therefore, by Corollary 1 and union bounds, we have 




N,P,^j \ < ^2cN,P,^j{\ 


Pi,j )Tn,S + 


3 


for all (z,j) G [d]^. 


□ 


Equation (30) can be viewed as constraints on the possible value that Pij may have (with high proba¬ 
bility). Since Pij is the only unobserved quantity in the bound from Eq. (30), we can numerically maximize 
\Pij — Pij \ subject to the constraint in Eq. (30) (viewing Pij as the optimization variable). Let be this 
maximum value, so we have 


P 


P — B* 

J ^1,3 ’ 


P 


B* 


in the same event where Eq. (30) holds. 

In the algorithm, we give a simple alternative to computing B*j that avoids numerical optimization, 
derived in the spirit of empirical Bernstein bounds [2]. Specifically, with c := 1.1 (an arbitrary choice), we 
compute 


Bij 


I 


CTn S 




for each {i,j) G [d]^, where Tn^s is defined in Eq. (31). We show in Lemma 4 that 


(32) 


P 


p — n . . p. 

1,3 -^1,3 ? 1,3 


B, 


again, in the same event where Eq. (30) holds. The observable bound in Eq. (32) is not too far from the 
unobservable bound in Eq. (30). 

Lemma 4. In the same 1 — d event as from Lemma 3, we have Pij G [Pij — Bi,j, Pi,j + Bi,j] for all 
{i,j) G [d]^, where Bi j is defined in Eq. (32). 

Proof. Recall that in the 1 — d probability event from Lemma 3, we have for all {i,j) G [d]^. 


\Pi,j — Pi,: 


Ni,j - NiPij ^ a - daPij 

^ 1 2cN^P^j{l - Pi,j)Tn,S 

Ni -\- do. Ni “h do 

" Y (fV, + da)2 


{A/2>)Tn,5 
Y da 


\a — daPij I 
Xi Y da 
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Applying the triangle inequality to the right-hand side, we obtain 


\Pi,j Pi,j\ ^ 


'2cNi{Pij{l - P^j) + \Pij - Pij\)Tn,S (4/3)t„_5 




(N, + daf 

\a - daPij \ + da\Pij - 


Ni + da 


N 2 A da 

Since y/A + B < vC? + y/B for non-negative A, B, we loosen the above inequality and rearrange it to obtain 
da 


1 - 


A da 


\Pi,j Pi,j\ ^ Pi,j 


' 2cNtTn,S 
(TV, A day 


1 2cTViP,j(l - Pij)Tn,s (4/3)t„_5 a |a - daP^J 


y {Ni A da)^ Ni A da 

Whenever Ni > 0, we can solve a quadratic inequality to conclude \Pij — Pij\ < 13ij. 


□ 


C.3 Empirical bounds for tt 

Recall that tt is obtained as the unique stationary distribution for P. Let A := I — P, and let be the 
group inverse of A —i.e., the unique square matrix satisfying the following equalities: 

AA*A = A, A*AA* = A*, A*A = AA*. 

The matrix which is well defined no matter what transition probability matrix P we start with [31], is 
a central quantity that captures many properties of the ergodic Markov chain with transition matrix P [31]. 
We denote the (z, j)-th entry of A^ by Afy Define 

K := i max ^A*^ - min |A#- : z e [d] | : j S [d]| . 

Analogously define 

A-I-P, 

A* : = group inverse of A, 

zc := i max ^Afj - min |a^^- : z e [d]| : j S [d]| . 

We now use the following perturbation bound from [10, Section 3.3] (derived from [18, 22]). 

Lemma 5 ([18, 22]). If \Pij — Pij\ < Bij for eaeh {i,j) G [d]^, then 

max{|7ri — zTi] : z G [d]} < mm{K, k} m.a.x{Bij : {i,j) G [d]^} 

< kmax{Bij : {i,j) G [d]^}. 

This establishes the validity of the confidence intervals for the in the same event from Lemma 3. 

We now establish the validity of the bounds for the ratio quantities and yrrijki. 

Lemma 6. //max{|Ai — zTi] : z G [d]} < h, then 


max (J {\^J^Ti/^ti - 1], Ix/ffi/TTi - 1]} < 
i^[d] 


1 

— max 
2 


U 

i^[d] 
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Proof. By Lemma 5, we have for each i G [d], 

b 


\'^i '^i 


< 


-TTil ^ ^ 


T^i T^i T^i [TTi — &] + 

Therefore, using the fact that for any x > 0, 

max 1|, IV^- 1|} < ^max{|a;- 1|, |l/a;- 1|} 

we have for every i € [d], 

max 11 v^tt^T^ - 1|, 1|} < ^ max {iTr^/ifi - 1|, lifi/Tr* - 1|} 


< — max ■ 
- 2 


b b 

’ [-Ki - b] + 


□ 


C.4 Empirical bounds for L 

By Weyl’s inequality and the triangle inequality, 

max|Ai - Ail < ||X - Sym(£)|| < ||i - £||. 

ie[d] 

It is easy to show that jy* — 7*| is bounded by the same quantity. Therefore, it remains to establish an 
empirical bound on ||i — i||. 

Lemma 7. If \Pij — Pij\ < Bi j for each {i,j) G [d]^ and max{|7ri — TTi\ : i G [d]} < b, then 


|L — L|| < 2p +/5^ + (1 + 2p + I ^ ~^i,j 


1/2 


where 


^:=-m.xU 1^. 

^e[d] I 


b 1 


- b]. 


Proof. We use the following decomposition oi L — L: 

L — L = Sp + St^^iL + L£t^2 + £tv,i£p + £p£tv,2 + £i^.iL£i^,2 + £tv.i£ p£tv,2 

where 

£p := Diag(7r)i/"(P - P) Diag(7r)-i/2, 


£tv.1 

£■1^.2 


= Diag(7r)^/^ Diag(7r) — J, 

= Diag(7r)^/^ Diag(7r)“^/^ — I. 


Therefore 


||i - L\\ < ll^7r,l|| + II^^Ti-yll + ||^7r,l||||^7r,2|| 

+ (1 + ||^7r,l|| + ||^7r,2|| + ||^7r,l||||^^7r,2||) liar’ll- 
Observe that for each {i,j) G [d]^, the (i, j)-th entry of is bounded in absolute value by 


I _ xl/2x-l/2| n 9 I ,xl/ 2 x-l /29 
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Since the spectral norm oi £p is bounded above by its Frobenius norm, 


\^A<{ E E 






Finally, the spectral norms of £-^,1 and 5,^,2 satisfy 

max{||f:^^i||, ||5^,2||} = max (J {IvE^- 1 |) IvE^- 1 |}> 

iG[d] 

which can be bounded using Lemma 6. □ 

This establishes the validity of the confidence interval for 7* in the same event from Lemma 3 . 

C.5 Asymptotic widths of intervals 

Let us now turn to the asymptotic behavior of the interval widths (regarding b, p, and w all as functions of 
n). 

A simple calculation gives that, almost surely, as n —>■ 00, 


log log n 


b = O I max i 


P, 


log log n 


p = 0 


n 

3/2 } ' 

TT* • 


^,3 

TT,; 


Here, we use the fact that /t —>■ k as n —>■ 00 since —>■ as P —>■ P [6, 27 ]. 

Further, since 


1/2 


log log n TTj 


Em 


TTi Pij{l — Pij) 


l/2\ 


= 0 K/ — 


we thus have 


n ~ ^ « 

w = O 


log log n 


3/2 


This completes the proof of Theorem 4 . 

The following claim provides a bound on n in terms of the number of states and the spectral gap. 
Claim 1. K < d/7*. 

Proof. It is established by Cho and Meyer [ 10 ] that 

K < max jAf"-] < sup Uv^A^jji 

(our K is the K4 quantity from [ 10 ]), and Seneta [ 38 ] establishes 

sup \\v^A*\\^<—. 
ll'«lli=i.(i’T>=o 7 * 


□ 


□ 
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D Proof of Theorem 5 


Let 7r*jb and 7 *,ib be the lower bounds on tt* and 7 *, respectively, computed from Algorithm 1. Let tt* and 
7 * be the estimates of tt* and 7 * computed using the estimators from Theorem 3. By a union bound, we 
have by Theorem 3 and Theorem 4 that with probability at least 1 — 2S, 


1'^* - 7r*| < C 


I TT* log ^ c log ^ . 


7ir,lhn 


7 *,lb^ 


(33) 


and 


I 7 * - 7*1 < c 


/logf - log-- 


,lbO 


log f ■ log : 


log; 




’7f*,lb7^,lb^ 


7 ^,lb^ 


(34) 


The bound on I 7 * — 7 *] in Eq. (34)— call it w '—is fully observable and hence yields a confidence interval for 
7 *. The bound on jir* — 7r*| in Eq. (33) depends on tt*, but from it one can derive 




iTT^logj- 


,lbO 


log 


d 

■S't.lb'? 


7*,ibn 


7*,ibn 


using the approach from the proof of Lemma 4. Here, C" > 0 is an absolute constant that depends only on 
C. This bound—call it b '—is now also fully observable. We have established that in the 1 — 26 probability 
event from above, 

TT* e t/ := [tt* - 6',7r* + S'], 7* e E := [7^ - w',% + w']. 

It is easy to see that almost surely (as n —>■ 00 ), 



log{d/d) 


7r*7* 


and 



This completes the proof of Theorem 5. 


□ 
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